# Video Question Answering
Llava Video 7B Qwen2 TPO
MIT
LLaVA-Video-7B-Qwen2-TPO is a video understanding model based on LLaVA-Video-7B-Qwen2 with temporal preference optimization, demonstrating excellent performance across multiple benchmarks.
Video-to-Text
Transformers

L
ruili0
490
1
Mplug Owl3 1B 241014
Apache-2.0
mPLUG-Owl3 is an advanced multimodal large language model focused on addressing the challenges of long image sequence understanding, significantly improving processing speed and sequence length through the Hyper Attention mechanism.
Text-to-Image
Safetensors English
M
mPLUG
617
2
Videochat2 HD Stage4 Mistral 7B Hf
MIT
VideoChat2-HD-hf is a multimodal video understanding model based on Mistral-7B, focusing on video-to-text conversion tasks.
Video-to-Text
Safetensors
V
OpenGVLab
393
3
Tarsier 7b
Tarsier-7b is an open-source large-scale video-language model from the Tarsier series, specializing in generating high-quality video descriptions with excellent general video understanding capabilities.
Video-to-Text
Transformers

T
omni-research
635
23
Cogvlm2 Video Llama3 Chat
Other
CogVLM2-Video is a high-performance video understanding model that achieves state-of-the-art performance in multiple video question-answering tasks, capable of completing video understanding within one minute.
Text-to-Video
Transformers English

C
THUDM
2,384
48
Llava NeXT Video 7B DPO Hf
LLaVA-NeXT-Video is an open-source multimodal chatbot optimized through mixed training on video and image data, possessing excellent video understanding capabilities.
Video-to-Text
Transformers English

L
llava-hf
12.61k
9
Llava NeXT Video 7B Hf
LLaVA-NeXT-Video is an open-source multimodal chatbot that achieves excellent video understanding capabilities through mixed training on video and image data, reaching SOTA level among open-source models on the VideoMME benchmark.
Text-to-Video
Transformers English

L
llava-hf
65.95k
88
Git Large Msrvtt Qa
MIT
GIT is a dual-condition Transformer decoder based on CLIP image tokens and text tokens, specifically fine-tuned for the MSRVTT-QA task.
Image-to-Text
Transformers Supports Multiple Languages

G
microsoft
108
2
Git Base Msrvtt Qa
MIT
GIT is a Transformer decoder based on CLIP image tokens and text tokens for vision-language tasks.
Image-to-Text
Transformers Supports Multiple Languages

G
microsoft
84
2
Featured Recommended AI Models